Final Project Worksheet - Rakamin Academy Data Science Batch V

Kelompok: Astro Boys

Dataset : Health Insurance

Anggota:

  • Robertsen Putra Sugianto
  • Tossy Adhahir Rukmana Rauf
  • Afiqi Ilman Pasha
  • Bintang Adi Kusuma
  • Arry Averrus Adhiputra

1. Load Dataset

Library yang digunakan

In [ ]:
pip install category_encoders -q
     |████████████████████████████████| 81kB 3.4MB/s eta 0:00:011
In [ ]:
import warnings
warnings.filterwarnings('ignore')

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import category_encoders as ce

# Feature extraction
from imblearn import under_sampling, over_sampling
from sklearn.preprocessing import MinMaxScaler, StandardScaler

Load Dataset

In [ ]:
from google.colab import drive
drive.mount('/content/drive')
Mounted at /content/drive
In [ ]:
path = '/content/drive/My Drive/AstroBoys_Notebook/data/'

df_train = pd.read_csv(path + 'train.csv')
In [ ]:
df_train.head()
Out[ ]:
id Gender Age Driving_License Region_Code Previously_Insured Vehicle_Age Vehicle_Damage Annual_Premium Policy_Sales_Channel Vintage Response
0 1 Male 44 1 28.0 0 > 2 Years Yes 40454.0 26.0 217 1
1 2 Male 76 1 3.0 0 1-2 Year No 33536.0 26.0 183 0
2 3 Male 47 1 28.0 0 > 2 Years Yes 38294.0 26.0 27 1
3 4 Male 21 1 11.0 1 < 1 Year No 28619.0 152.0 203 0
4 5 Female 29 1 41.0 1 < 1 Year No 27496.0 152.0 39 0

2. Data Exploration

Describe Data

In [ ]:
df_train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 381109 entries, 0 to 381108
Data columns (total 12 columns):
 #   Column                Non-Null Count   Dtype  
---  ------                --------------   -----  
 0   id                    381109 non-null  int64  
 1   Gender                381109 non-null  object 
 2   Age                   381109 non-null  int64  
 3   Driving_License       381109 non-null  int64  
 4   Region_Code           381109 non-null  float64
 5   Previously_Insured    381109 non-null  int64  
 6   Vehicle_Age           381109 non-null  object 
 7   Vehicle_Damage        381109 non-null  object 
 8   Annual_Premium        381109 non-null  float64
 9   Policy_Sales_Channel  381109 non-null  float64
 10  Vintage               381109 non-null  int64  
 11  Response              381109 non-null  int64  
dtypes: float64(3), int64(6), object(3)
memory usage: 34.9+ MB

Pada penggambaran informasi data di atas, terlihat bahwa total terdapat total 12 kolom, 381.109 baris data, serta tidak ada data yang hilang. Dengan 3 kolom diantaranya bersifat kategoris (Gender, Vehicle_Age, Vehicle_Damage), dan 8 kolom sisanya bersifat numerik.

Definisi kolom:

  1. id: Nomor identitas unik untuk setiap pelanggan
  2. Gender: Jenis kelamin pelanggan
  3. Age: Umur pelanggan
  4. Driving_License: Apakah pelanggan memiliki SIM atau tidak (Nilai 0: Tidak punya SIM, Nilai 1: Punya SIM)
  5. Region_Code: Kode pos unik untuk setiap wilayah tempat tinggal pelanggan
  6. Previously_Insured: Apakah pelanggan sebelumnya pernah memiliki Asuransi Kendaraan (Nilai 0: Tidak pernah punya Asuransi, Nilai 1: Pernah punya Asuransi)
  7. Vehicle_Age: Umur kendaraan pelanggan
  8. Vehicle_Damage: Apakah kendaraan pelanggan sebelumnya pernah rusak (Nilai 0: Tidak pernah rusak, Nilai 1: Pernah rusak)
  9. Annual_Premium: Jumlah biaya premi yang harus dibayarkan pelanggan setiap tahun
  10. Policy_Sales_Channel: Kode anonim untuk setiap jalur penawaran ke pelanggan, contohnya melalui Agen, Surat, Telepon, Secara Langsung, dll.)
  11. Vintage: Jumlah hari, seberapa lama pelanggan terasosiasikan dengan perusahaan
  12. Response: Apakah pelanggan tertarik dengan Asuransi Kendaraan (Nilai 0: Tidak tertarik, Nilai 1: Tertarik)
In [ ]:
df_train.isnull().sum().reset_index()
Out[ ]:
index 0
0 id 0
1 Gender 0
2 Age 0
3 Driving_License 0
4 Region_Code 0
5 Previously_Insured 0
6 Vehicle_Age 0
7 Vehicle_Damage 0
8 Annual_Premium 0
9 Policy_Sales_Channel 0
10 Vintage 0
11 Response 0

Pada tabel data di atas, terlihat bahwa dari semua 11 kolom yang tersedia, tidak ada satupun yang memiliki data Null (hilang).

Hanya 12.256336113815209 % dari semua populasi di mana response bernilai 1

Pendekatan Numerik

In [ ]:
df_train.describe()
Out[ ]:
id Age Driving_License Region_Code Previously_Insured Annual_Premium Policy_Sales_Channel Vintage Response
count 381109.000000 381109.000000 381109.000000 381109.000000 381109.000000 381109.000000 381109.000000 381109.000000 381109.000000
mean 190555.000000 38.822584 0.997869 26.388807 0.458210 30564.389581 112.034295 154.347397 0.122563
std 110016.836208 15.511611 0.046110 13.229888 0.498251 17213.155057 54.203995 83.671304 0.327936
min 1.000000 20.000000 0.000000 0.000000 0.000000 2630.000000 1.000000 10.000000 0.000000
25% 95278.000000 25.000000 1.000000 15.000000 0.000000 24405.000000 29.000000 82.000000 0.000000
50% 190555.000000 36.000000 1.000000 28.000000 0.000000 31669.000000 133.000000 154.000000 0.000000
75% 285832.000000 49.000000 1.000000 35.000000 1.000000 39400.000000 152.000000 227.000000 0.000000
max 381109.000000 85.000000 1.000000 52.000000 1.000000 540165.000000 163.000000 299.000000 1.000000

Pada penggambaran data numerik di atas, terlihat bahwa secara umum, tidak ada data yang secara statistik terlihat aneh, terkecuali kolom Annual_Premium yang memiliki nilai maksimum yang sangat besar dan berbeda jauh dengan nilai minimum. Masukan datanya pun terlihat sama, yakni 381.109 masukan data.

In [ ]:
df_train[['Gender','Vehicle_Age','Vehicle_Damage']].describe()
Out[ ]:
Gender Vehicle_Age Vehicle_Damage
count 381109 381109 381109
unique 2 3 2
top Male 1-2 Year Yes
freq 206089 200316 192413

Pada penggambaran di atas, tampak bahwa data kategoris bersifat wajar. Dengan kolom Gender, memiliki 2 nilai unik, dengan modus yaitu kelompok data Male, dengan kemunculan sebanyak 206.089 kali. Di sisi lain, kolom Vehicle_Age, memiliki 2 nilai unik, dengan modus yaitu kelompok data 1-2 Year, dengan frekuensi nilai sebanyak 200.316. Pada kolom Gender, terlihat bahwa kolom tersebut memiliki 2 nilai unik, dengan modus yaitu kelompok data Yes, dengan kemunculan sebanyak 192.413 kali. Masukan datanya pun terlihat sama, yakni 381.109 masukan data.

Pendekatan Grafis: Univariate Analysis

In [ ]:
features1a=['Age','Driving_License','Region_Code','Previously_Insured','Annual_Premium','Policy_Sales_Channel','Vintage','Response']
plt.figure(figsize=(12,20))
for i in range(0,len(features1a)):
    plt.subplot(6,9,i+1)
    sns.boxplot(y=df_train[features1a[i]],color='green',orient='v')
    plt.tight_layout()

Dari grafik yang dihasilkan, terdapat Outliers yang cukup banyak untuk kolom Annual_Premium, dengan jumlah cukup yang besar dan jauh

In [ ]:
plt.figure(figsize=(12,20))
for i in range(0,len(features1a)):
    plt.subplot(6,9,i+1)
    sns.violinplot(y=df_train[features1a[i]],color='blue',orient='v')
    plt.tight_layout()
In [ ]:
data_num1=df_train[features1a]
k=len(data_num1.columns)
n=3
m=(k-1)//n+1
fig,axes=plt.subplots(m,n,figsize=(n*5,m*3))
for i,(name,col) in enumerate(data_num1.iteritems()):
    r,c=i//n,i%n
    ax=axes[r,c]
    col.hist(ax=ax,color='green')
    ax2=col.plot.kde(ax=ax,secondary_y=True,title=name,color='red')
    ax2.set_ylim(0)

fig.tight_layout()

Berdasarkan grafik di atas, terlihat bahwa distribusi Age bersifat Positively Skewed. Response pun juga cukup timpang

In [ ]:
features1b=['Gender','Vehicle_Age','Vehicle_Damage']
plt.figure(figsize=(10,4))
for i in range(0,len(features1b)):
    plt.subplot(1,3,i+1)
    sns.countplot(x=df_train[features1b[i]],color='green',orient='v')
    plt.tight_layout()

Dari grafik itu, tampak bahwa Vehicle_Damage memiliki distribusi data yang cukup seimbang. Sementara itu, untuk Vehicle_Age cukup timpang, dengan kategori nilai >2 Years kalah jauh dengan yang lain.

Pendekatan Grafis: Multivariate Analysis

In [ ]:
corr_=df_train[features1a].corr()
plt.figure(figsize=(16,10))
sns.heatmap(corr_,annot=True,fmt=".2f",cmap="BuPu");

Dari grafik Heatmap tersebut, sejauh ini tidak tampak kolom yang memiliki korelasi kuat (nilai > 0.7)

In [ ]:
sns.pairplot(df_train[features1a],
             diag_kind='kde',
             plot_kws={'alpha':0.6,'s':80,'edgecolor':'k','color':'green'},
             size=4);

plt.tight_layout()
In [ ]:
sns.pairplot(df_train[features1a],
             diag_kind='kde',hue='Response',
             plot_kws={'alpha':0.6,'s':80,'edgecolor':'k','color':'green'},
             size=4);

plt.tight_layout()
In [ ]:
fig, (ax1, ax2) = plt.subplots(1,2,figsize=(15,7))
g=sns.catplot(x='Vehicle_Damage',y='Annual_Premium',hue='Response',data=df_train,ax=ax1)
g=sns.catplot(x='Vehicle_Damage',y='Annual_Premium',hue='Response',kind='swarm',data=df_train,ax=ax2)
In [ ]:
fig,(ax1,ax2)=plt.subplots(nrows=1,ncols=2,figsize=(20,8))
g=sns.countplot('Gender',hue='Response',data=df_train,ax=ax1,palette='husl')
ax1.set_title('Response Rate by Gender')

g=sns.barplot(x='Gender',y='Response',data=df_train,ax=ax2)
ax2.set_title('Response Rate by Gender')
ax2.set_xlabel('Gender')
ax2.set_ylabel('Response Probability')
Out[ ]:
Text(0, 0.5, 'Response Probability')
In [ ]:
fig,(ax1,ax2)=plt.subplots(nrows=1,ncols=2,figsize=(20,8))
g=sns.countplot('Vehicle_Age',hue='Response',data=df_train,ax=ax1,palette='husl')
ax1.set_title('Response Rate by Vehicle Age')

g=sns.barplot(x='Vehicle_Age',y='Response',data=df_train,ax=ax2)
ax2.set_title('Response Rate by Vehicle Age')
ax2.set_xlabel('Vehicle_Age')
ax2.set_ylabel('Response Probability')
Out[ ]:
Text(0, 0.5, 'Response Probability')
In [ ]:
fig,(ax1,ax2)=plt.subplots(nrows=1,ncols=2,figsize=(20,8))
g=sns.countplot('Vehicle_Damage',hue='Response',data=df_train,ax=ax1,palette='husl')
ax1.set_title('Response Rate by Vehicle Damage')

g=sns.barplot(x='Vehicle_Damage',y='Response',data=df_train,ax=ax2)
ax2.set_title('Response Rate by Vehicle Damage')
ax2.set_xlabel('Vehicle_Damage')
ax2.set_ylabel('Response Probability')
Out[ ]:
Text(0, 0.5, 'Response Probability')

3. Data Cleaning

Cek Missing Value

In [ ]:
data_missing_value = df_train.isnull().sum().reset_index()
data_missing_value
Out[ ]:
index 0
0 id 0
1 Gender 0
2 Age 0
3 Driving_License 0
4 Region_Code 0
5 Previously_Insured 0
6 Vehicle_Age 0
7 Vehicle_Damage 0
8 Annual_Premium 0
9 Policy_Sales_Channel 0
10 Vintage 0
11 Response 0
In [ ]:
df_train.isnull().sum()
Out[ ]:
id                      0
Gender                  0
Age                     0
Driving_License         0
Region_Code             0
Previously_Insured      0
Vehicle_Age             0
Vehicle_Damage          0
Annual_Premium          0
Policy_Sales_Channel    0
Vintage                 0
Response                0
dtype: int64

Cek Duplikat

In [ ]:
df_train.duplicated().sum()
Out[ ]:
0

Persebaran data analysis dengan boxplot

In [ ]:
features =  ['Age','Driving_License','Region_Code','Previously_Insured','Annual_Premium','Policy_Sales_Channel','Vintage','Response']
plt.figure(figsize=(12,20))
for i in range(0, len(features)):
    plt.subplot(6,9,i+1)
    sns.boxplot(y = df_train[features[i]],color='Navy',orient='v')
    plt.tight_layout()

Dapat dilihat pada variabel Annual_Premium terdapat outlier yang sangat banyak

Normalization

In [ ]:
data_tes = df_train
In [ ]:
f,ax = plt.subplots(2,2,figsize=(18,15))

g = sns.distplot(data_tes['Annual_Premium'],kde=True, ax=ax[0,0])
ax[0,0].set_title('Annual_Premium - Original')
ax[0,0].set_xlabel('')

g = sns.boxplot(data_tes['Annual_Premium'],color='green',orient='h', ax=ax[0,1])
ax[0,1].set_title('Annual_Premium - Original')
ax[0,1].set_xlabel('')

g = sns.distplot(np.log1p(data_tes['Annual_Premium']+1),kde=True, ax=ax[1,0])
ax[1,0].set_title('Annual_Premium - log transformation')
ax[1,0].set_xlabel('')

g = sns.boxplot(np.log1p(data_tes['Annual_Premium']+1),color='green',orient='h', ax=ax[1,1])
ax[1,1].set_title('Annual_Premium - log transformation')
ax[1,1].set_xlabel('')
Out[ ]:
Text(0.5, 0, '')

Mendeteksi & Membuang outlier

In [ ]:
Q1 = data_tes['Annual_Premium'].quantile(0.25)
Q3 = data_tes['Annual_Premium'].quantile(0.75)
IQR = Q3 - Q1
low_limit = Q1 - (1.5 * IQR)
high_limit = Q3 + (1.5 * IQR)
filtered_entries = ((data_tes['Annual_Premium'] >= low_limit) & (data_tes['Annual_Premium'] <= high_limit))
data_tes = data_tes[filtered_entries]
In [ ]:
data_tes.shape
Out[ ]:
(370789, 12)
In [ ]:
features11 =  ['Age','Driving_License','Region_Code','Previously_Insured','Annual_Premium','Policy_Sales_Channel','Vintage','Response']
plt.figure(figsize=(12,20))
for i in range(0, len(features11)):
    plt.subplot(6,9,i+1)
    sns.boxplot(y = data_tes[features11[i]],color='Red',orient='v')
    plt.tight_layout()

4. Feature Engineering

Merging Features

In [ ]:
df_merge = data_tes
merged_value = ['> 2 Years', '1-2 Year']
df_merge['Vehicle_Age'] = np.where(df_merge['Vehicle_Age'].isin(merged_value), '> 1 Year', '< 1 Year')

Standardization / Normailization

In [ ]:
print(df_merge.shape)
print(df_train.shape)
(370789, 12)
(381109, 12)
In [ ]:
df_merge.head()
Out[ ]:
id Gender Age Driving_License Region_Code Previously_Insured Vehicle_Age Vehicle_Damage Annual_Premium Policy_Sales_Channel Vintage Response
0 1 Male 44 1 28.0 0 2 Yes 40454.0 26.0 217 1
1 2 Male 76 1 3.0 0 1 No 33536.0 26.0 183 0
2 3 Male 47 1 28.0 0 2 Yes 38294.0 26.0 27 1
3 4 Male 21 1 11.0 1 0 No 28619.0 152.0 203 0
4 5 Female 29 1 41.0 1 0 No 27496.0 152.0 39 0
In [ ]:
# Fungsi normalisasi atau standardisasi
def normalize_standardize(data, op = 'standardize'):
  if (op == 'standardize'):
    sc_data = StandardScaler().fit_transform(data.values.reshape(len(data),1 ))
    return sc_data
  elif (op == 'normalize'):
    sc_data = StandardScaler().MinMaxScaler().fit_transform(data.values.reshape(len(data),1 ))
    return sc_data
  else:
    print("Operasi yang dimasukan bukan 'normalze' atau 'standardize'. Silakan coba lagi...")
    return 0
In [ ]:
# Pisahin antara numerik yang mau distandardisasi, ngilangin id, driving_license, previously_insured, dan response
numerical_column = ['Age', 'Annual_Premium', 'Policy_Sales_Channel','Vintage' ]

# Pisahin kolom string (objek)
object_column = list(df_train.select_dtypes(include = ['object']).columns)

# Bikin dataframe baru biar ga harus ulang dari awal kalau error
df_std = df_merge

# Standardisasi setiap kolom
for feature in numerical_column:
  df_std[feature] = normalize_standardize(df_std[feature], 'standardize')

# Tampilkan data setelah standardisasi
df_std
Out[ ]:
id Gender Age Driving_License Region_Code Previously_Insured Vehicle_Age Vehicle_Damage Annual_Premium Policy_Sales_Channel Vintage Response
0 1 Male 0.345182 1 28.0 0 > 1 Year Yes 0.758959 -1.601474 0.748826 1
1 2 Male 2.417701 1 3.0 0 > 1 Year No 0.289720 -1.601474 0.342470 0
2 3 Male 0.539480 1 28.0 0 > 1 Year Yes 0.612449 -1.601474 -1.521990 1
3 4 Male -1.144442 1 11.0 1 < 1 Year No -0.043793 0.730152 0.581503 0
4 5 Female -0.626312 1 41.0 1 < 1 Year No -0.119965 0.730152 -1.378570 0
... ... ... ... ... ... ... ... ... ... ... ... ...
381104 381105 Male 2.288169 1 26.0 1 > 1 Year No 0.061409 -1.601474 -0.792938 0
381105 381106 Male -0.561545 1 37.0 1 < 1 Year No 0.729250 0.730152 -0.279017 0
381106 381107 Male -1.144442 1 30.0 1 < 1 Year No 0.397025 0.878192 0.079533 0
381107 381108 Female 1.899571 1 14.0 0 > 1 Year Yes 1.041329 0.212013 -0.960262 0
381108 381109 Male 0.474714 1 29.0 0 > 1 Year No 0.848696 -1.601474 0.987859 0

370789 rows × 12 columns

Encoding

In [ ]:
# Buat dataframe baru
df_encoded = df_std
object_column = list(df_std.select_dtypes(include = ['object']).columns)

print("Shape before encoding:",df_encoded.shape)
print("Column to be encoded:",object_column)

# One Hot Encoding
for feature in object_column:
  dummies = pd.get_dummies(df_encoded[feature], prefix=feature, drop_first = True)
  # Append ke dataframe awal
  df_encoded = pd.concat([df_encoded, dummies], axis=1)

print("Shape after encoding:", df_encoded.shape)

# Hapus kolom yang sudah di encoding
df_encoded = df_encoded.drop(object_column,axis= 1)
print("Shape after dropping column:", df_encoded.shape)
df_encoded.head()
Shape before encoding: (370789, 12)
Column to be encoded: ['Gender', 'Vehicle_Age', 'Vehicle_Damage']
Shape after encoding: (370789, 15)
Shape after dropping column: (370789, 12)
Out[ ]:
id Age Driving_License Region_Code Previously_Insured Annual_Premium Policy_Sales_Channel Vintage Response Gender_Male Vehicle_Age_> 1 Year Vehicle_Damage_Yes
0 1 0.345182 1 28.0 0 0.758959 -1.601474 0.748826 1 1 1 1
1 2 2.417701 1 3.0 0 0.289720 -1.601474 0.342470 0 1 1 0
2 3 0.539480 1 28.0 0 0.612449 -1.601474 -1.521990 1 1 1 1
3 4 -1.144442 1 11.0 1 -0.043793 0.730152 0.581503 0 1 0 0
4 5 -0.626312 1 41.0 1 -0.119965 0.730152 -1.378570 0 0 0 0
In [ ]:
# Change Region Code type to string 
df_encoded['Region_Code'] = df_encoded['Region_Code'].astype(str)
print('Ada',df_encoded['Region_Code'].value_counts().count(), 'unique value pada Region_Code yang perlu diencode')

# Encoding with Binary Encoder
rc_encoded = ce.BinaryEncoder().fit_transform(df_encoded['Region_Code'])
print('Jumlah kolom yang terbentuk dari proses encoding region code:', rc_encoded.shape[1])
df_encoded = pd.concat([df_encoded, rc_encoded], axis = 1)
df_encoded = df_encoded.drop(['Region_Code', 'Region_Code_0'], axis =1)
df_encoded.head()
Ada 53 unique value pada Region_Code yang perlu diencode
Jumlah kolom yang terbentuk dari proses encoding region code: 7
Out[ ]:
id Age Driving_License Previously_Insured Annual_Premium Policy_Sales_Channel Vintage Response Gender_Male Vehicle_Age_> 1 Year Vehicle_Damage_Yes Region_Code_1 Region_Code_2 Region_Code_3 Region_Code_4 Region_Code_5 Region_Code_6
0 1 0.345182 1 0 0.758959 -1.601474 0.748826 1 1 1 1 0 0 0 0 0 1
1 2 2.417701 1 0 0.289720 -1.601474 0.342470 0 1 1 0 0 0 0 0 1 0
2 3 0.539480 1 0 0.612449 -1.601474 -1.521990 1 1 1 1 0 0 0 0 0 1
3 4 -1.144442 1 1 -0.043793 0.730152 0.581503 0 1 0 0 0 0 0 0 1 1
4 5 -0.626312 1 1 -0.119965 0.730152 -1.378570 0 0 0 0 0 0 0 1 0 0

Class Imbalance Handling

In [ ]:
# Pesebaran sebelum handling class imbalance
sns.countplot(x = 'Response', data = df_encoded)
Out[ ]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fa84dc0ddd8>
In [ ]:
from imblearn import under_sampling, over_sampling

# Pemisahan fitur dan target
X = df_encoded.drop(['Response'],axis=1)
x_columns = list(X.columns)
y = df_encoded['Response']

# Under sampling
X_under, y_under = under_sampling.RandomUnderSampler(random_state=42).fit_resample(X, y)

# Mengubah numpy array ke Dataframe agar bisa diconcat
X_under = pd.DataFrame(X_under)
y_under = pd.DataFrame(y_under)

# Rename lagi biar namanya ga 0,1,2,3,4,.... 
X_under.columns = x_columns
y_under = y_under.rename(columns = {0: 'Response'})

# Concat dan buat dataframe baru
df_under = pd.concat([X_under,y_under], axis = 1)

# Hapus id
df_under = df_under.drop('id',axis =1)

df_under.head()
Out[ ]:
Age Driving_License Previously_Insured Annual_Premium Policy_Sales_Channel Vintage Gender_Male Vehicle_Age_> 1 Year Vehicle_Damage_Yes Region_Code_1 Region_Code_2 Region_Code_3 Region_Code_4 Region_Code_5 Region_Code_6 Response
0 -1.144442 1.0 1.0 0.083521 0.730152 0.007823 0.0 0.0 0.0 0.0 1.0 1.0 0.0 1.0 1.0 0
1 -0.885377 1.0 1.0 0.443623 0.730152 1.382264 0.0 0.0 0.0 0.0 1.0 1.0 0.0 1.0 1.0 0
2 0.863312 1.0 1.0 -0.163036 -1.601474 -0.589760 1.0 1.0 1.0 0.0 1.0 1.0 0.0 1.0 1.0 0
3 -0.691078 1.0 1.0 1.992083 0.730152 -1.521990 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0 1.0 0
4 0.345182 1.0 0.0 0.270253 0.175003 0.569551 0.0 1.0 1.0 0.0 0.0 0.0 0.0 1.0 1.0 0
In [ ]:
# Setelah class imbalance
sns.countplot(x = 'Response', data = df_under)
Out[ ]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fa84c448198>
In [ ]:
print(df_under['Response'].value_counts())
print(df_under.shape)
1    45155
0    45155
Name: Response, dtype: int64
(90310, 16)

Penentuan Feature

In [ ]:
# Buat df baru
df_final = df_under

features = list(df_final.columns)
corr_= df_final[features].corr()
plt.figure(figsize=(16,10))
sns.heatmap(corr_, annot=True, fmt = ".2f", cmap = "BuPu")
Out[ ]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f80bd34e2b0>

Pada tahap awal, akan digunakan keseluruhan fiturnya terlebih dahulu (hanya dengan menghapus id)

Untuk tahap selanjutnya apabila performa model kurang baik, akan digunakan fitur demikian:

  • Age
  • Previously_Insurance
  • Policy_Sales_Channel
  • Vehicle_Age
  • Vehicle_Damage

Dan membuang:

  • id
  • Driving Lcense
  • Region_Code
  • Annual_Premium
  • Vintage
  • Gender

STAGE 2

0.LOAD DATA AWAL

In [ ]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
In [ ]:
from google.colab import drive
drive.mount('/content/drive')
Mounted at /content/drive
In [ ]:
path = '/content/drive/My Drive/AstroBoys_Notebook/data/'

df= pd.read_csv(path + 'train.csv')
In [ ]:
df.head()
Out[ ]:
id Gender Age Driving_License Region_Code Previously_Insured Vehicle_Age Vehicle_Damage Annual_Premium Policy_Sales_Channel Vintage Response
0 1 Male 44 1 28.0 0 > 2 Years Yes 40454.0 26.0 217 1
1 2 Male 76 1 3.0 0 1-2 Year No 33536.0 26.0 183 0
2 3 Male 47 1 28.0 0 > 2 Years Yes 38294.0 26.0 27 1
3 4 Male 21 1 11.0 1 < 1 Year No 28619.0 152.0 203 0
4 5 Female 29 1 41.0 1 < 1 Year No 27496.0 152.0 39 0
In [ ]:
 

1. INSIGHT

In [ ]:
plt.figure(figsize=(10,8))
sns.countplot('Vehicle_Damage', hue ='Response', data = df)
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
plt.xlabel(xlabel = 'Is owner vehicle has even damaged?',fontsize=15)
plt.ylabel(ylabel = 'Number of People',fontsize=15)
plt.legend(title = 'Get insurance', labels = ['No','Yes'], fontsize = 12)
df_insight_damage = df.groupby(['Vehicle_Damage','Response'])['id'].count().reset_index().rename(columns={'id' : 'count'})
res_insight_damage = list(df_insight_damage['Response'])
dam_insight_damage = list(df_insight_damage['Vehicle_Damage'])
count_insight_damage = list(df_insight_damage['count'])
for i in range(0,len(res_insight_damage)):
    plt.text(x = (0 if dam_insight_damage[i] == 'Yes' else 1) + (-0.3 if res_insight_damage[i]%2 == 0 else 0.13) 
             , y = count_insight_damage[i] +3000
             , s=str(count_insight_damage[i])
             , fontsize=13 
             , fontweight='bold')

plt.text(x =-0.8, y= 230000, s = 'People who had their car damaged most like take the insurance', fontweight = 'bold', fontsize = 18)
plt.text(x =-0.8, y= 220000, s = 'The bad experience that someone has with their vehicle will make people think ', fontsize = 14)
plt.text(x =-0.8, y= 210000, s = 'more about taking out insurance', fontsize = 14)
Out[ ]:
Text(-0.8, 210000, 'more about taking out insurance')
In [ ]:
df_insight_region = df[df['Response'] == 1].groupby('Region_Code').count().reset_index().sort_values('Response', ascending = False).head(5)
df_insight_region['Region_Code'] = df_insight_region['Region_Code'].astype(int)

plt.figure(figsize = (10,8))
sns.barplot(x = 'Region_Code', y = 'Response', data =df_insight_region)
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
plt.xlabel(xlabel = 'Region Code',fontsize=15)
plt.ylabel(ylabel = 'Count',fontsize=15)

# df_insight_region
cnt_insight_region = list(df_insight_region.sort_values('Region_Code')['Response'])
reg_insight_region = list(df_insight_region.sort_values('Region_Code')['Region_Code'])

for i in range(0,len(cnt_insight_region)):
    plt.text(x = i -0.2
             , y = cnt_insight_region[i] + 300
             , s=str(cnt_insight_region[i])
             , fontsize=13 
             , fontweight='bold')
    
plt.text(x =-1, y= 22500, s = 'People who are in region code 28 tend to choose to use insurance', fontweight = 'bold', fontsize = 18)
plt.text(x =-1, y= 21500, s = 'Region 28 is the largest contributor to people using insurance ', fontsize = 14)
plt.text(x =2.5, y= 20000, s = '*data is taken from people who are',fontstyle = 'italic', fontsize = 12)
plt.text(x =2.9, y= 19300, s = 'confirmed to take insurance',fontstyle = 'italic', fontsize = 12)
Out[ ]:
Text(2.9, 19300, 'confirmed to take insurance')
In [ ]:
df_insight_region_0 = df[df['Response'] == 0].groupby('Region_Code').count().reset_index().sort_values('Response', ascending = False).head(10)
reg_insight_region_0 = list(df_insight_region_0.sort_values('Region_Code')['Region_Code'])
df_insight_region_1 = df[df['Response'] == 1].groupby('Region_Code').count().reset_index().sort_values('Response', ascending = False).head(10)
reg_insight_region_1 = list(df_insight_region_1.sort_values('Region_Code')['Region_Code'])

df_car_damage_0 = df[(df['Response'] == 0) & (df['Region_Code'].isin(reg_insight_region_0))]
df_car_damage_1 = df[(df['Response'] == 1) & (df['Region_Code'].isin(reg_insight_region_1))]

df_merge_car_damage = pd.concat([df_car_damage_0,df_car_damage_1])
df_merge_car_damage['Region_Code'] = df_merge_car_damage['Region_Code'].astype(int)

plt.figure(figsize=(15,12))
sns.countplot(y ='Region_Code', hue = 'Response', data= df_merge_car_damage)
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
plt.ylabel(ylabel = 'Region Code',fontsize=15)
plt.xlabel(xlabel = 'Count',fontsize=15)
plt.grid()

df_nol = df_merge_car_damage[df_merge_car_damage['Response'] == 0]['Region_Code'].value_counts().reset_index()
xlabel_reg_0 = list(df_nol['index'])
ylabel_reg_0 = list(df_nol['Region_Code'])
df_satu= df_merge_car_damage[df_merge_car_damage['Response'] == 1]['Region_Code'].value_counts().reset_index()
xlabel_reg_1 = list(df_satu['index'])
ylabel_reg_1 = list(df_satu['Region_Code'])

x_percentage = list(df_merge_car_damage.sort_values('Region_Code')['Region_Code'].unique())
y_percentage = []
y_real_value = []

print(x_percentage)
def search_list(list, x):
    for i in range(0,len(list)):
        if(list[i] == x):
            return True, i
    return False, -1

for i in x_percentage:
    res1, id1 = search_list(xlabel_reg_0, i)
    res2, id2 = search_list(xlabel_reg_1, i)
    if (res1 and res2):
        y_percentage.append(100*(ylabel_reg_1[id2]/ylabel_reg_0[id1]))
        y_real_value.append(ylabel_reg_1[id2])
    else:
        y_percentage.append(0)
        if res2:
            y_real_value.append(ylabel_reg_1[id2])
        else:
             y_real_value.append(0)

for i in range(0, len(x_percentage)):
     plt.text(y = i + 0.3
             , x = y_real_value[i] + 300
             , s=str(round(y_percentage[i],2)) + '%'
             , fontsize=12)

plt.text(x =0, y= -1.3, s = 'Opportunities for people in region 28 to take part in the insurance program are 23.03%', fontweight = 'bold', fontsize = 18)
plt.text(x =0, y= -1, s = 'Region code 28 is the region that contributes the most to our vehicle ', fontsize = 14)
plt.text(x =0, y= -0.7, s = 'insurance customers, followed by region code 29', fontsize = 14)
[3, 8, 11, 15, 28, 29, 30, 35, 36, 41, 46, 50]
Out[ ]:
Text(0, -0.7, 'insurance customers, followed by region code 29')
In [ ]:
df_insight_region = df[df['Vehicle_Damage'] == 'Yes'].groupby('Region_Code').count().reset_index().sort_values('Vehicle_Damage', ascending = False).head(5)
df_insight_region
fig, ax = plt.subplots(figsize = (12,8))
sns.barplot(x = 'Region_Code', y = 'Vehicle_Damage', data =df_insight_region, ax = ax)
plt.xticks(ticks = [0,1,2,3,4], labels = ['Rajasthan', 'Mizoram', 'Tamil Nadu', 'Ladakh', 'Kerala'], fontsize=15)
plt.yticks(fontsize=15)
plt.xlabel(xlabel = 'Region Name',fontsize=15)
plt.ylabel(ylabel = 'Count of Damaged Vehicle',fontsize=15)

# df_insight_region
cnt_insight_region = list(df_insight_region.sort_values('Region_Code')['Response'])
reg_insight_region = list(df_insight_region.sort_values('Region_Code')['Region_Code'])

for i in range(0,len(cnt_insight_region)):
    plt.text(x = i -0.3
             , y = cnt_insight_region[i] + 300
             , s=str(cnt_insight_region[i])
             , fontsize=18 
             , fontweight='bold')
    
plt.text(x =-0.6, y= 80000, s = 'The increase in the number of insurance users in region 28', fontweight = 'bold', fontsize = 22)
plt.text(x =0.2, y= 76000, s = '(Tamil Nadu) was due to many damaged cars', fontweight = 'bold', fontsize = 22)

from matplotlib.patches import Rectangle
import matplotlib.patches as patches
ax.add_patch(Rectangle((1.45, 0), 1.1, 75000, fill=True, facecolor ='red', alpha=0.1))
ax.add_patch(Rectangle((1.45, 0), 1.1, 75000, fill=None, edgecolor='red', alpha=1, linestyle = '--', linewidth = 2))
Out[ ]:
<matplotlib.patches.Rectangle at 0x7f62e1d70080>
In [ ]:
fig, ax = plt.subplots(figsize = (12,8))
sns.distplot(df[df['Response'] == 0]['Age'],hist = False, kde_kws ={"lw" :3}, ax = ax)
sns.distplot(df[df['Response'] == 1]['Age'],hist = False, kde_kws ={"lw" :3}, ax = ax)
plt.legend(title = 'Response', labels = ['No', 'Yes'], fontsize = 12)
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
plt.xlabel(xlabel = 'Age',fontsize=15)
plt.ylabel(ylabel = 'PDF',fontsize=15)

from matplotlib.patches import Rectangle
import matplotlib.patches as patches
ax.add_patch(Rectangle((30, 0), 32, 0.07, fill=True, facecolor ='red', alpha=0.1))
ax.add_patch(Rectangle((30, 0), 32, 0.07, fill=None, edgecolor='red', alpha=1, linestyle = '--', linewidth = 2))

plt.text(x =20, y= 0.073, s = 'The age range of 30-62 years is the age range in which ', fontweight = 'bold', fontsize = 18)
plt.text(x =35, y= 0.07, s = 'it is possible to take out insurance', fontweight = 'bold', fontsize = 18)

countmax_age = df_train[df_train['Response'] == 1].groupby('Age').count().reset_index().sort_values('id', ascending = False).iloc[0,1]
ax.annotate('Maximum at age 44', xy=(44, 0.038), xytext=(62, 0.05),
            arrowprops=dict(facecolor='black', shrink=0.05), fontsize = 18)
plt.text(x =65, y= 0.047, s = 'with ' + str(countmax_age) + ' people', fontsize = 18)
Out[ ]:
Text(65, 0.047, 'with 1811 people')
In [ ]:
fig, ax = plt.subplots(figsize=(12,8))
sns.countplot('Response', data = df_train, ax = ax)
plt.xticks(ticks = [0, 1], labels = ['No','Yes'], fontsize=15)
plt.yticks(fontsize=15)
plt.xlabel(xlabel = 'Response',fontsize=15)
plt.ylabel(ylabel = 'Count',fontsize=15)

percentage_response = []
count = []
count.append(df['Response'].value_counts().reset_index()['Response'][0])
count.append(df['Response'].value_counts().reset_index()['Response'][1])
total = sum(count)
percentage_response.append(count[0]/(count[0]+count[1]) * 100)
percentage_response.append(count[1]/(count[0]+count[1]) * 100)

for i in range(0,len(percentage_response)):
    plt.text(x = i - 0.10
             , y = count[i] - 20000
             , s=str(round(count[i]))
             , fontsize=18 
             , fontweight='bold'
             , color = 'white')
    plt.text(x = i - 0.15
             , y = count[i] + 3000
             , s=str(round(percentage_response[i], 4)) + '%'
             , fontsize=22 
             , fontweight='bold')
plt.text(x =1.1, y= 335000, s = 'Total: ' + str(total), fontsize = 18)

plt.text(x =-0.6, y= 400000, s = 'There are only about 12.25% of people who are willing', fontweight = 'bold', fontsize = 24)
plt.text(x =-0.6, y= 375000, s = 'to take an offer of vehicle insurance', fontweight = 'bold', fontsize = 24)

plt.savefig('responsecount.png', bbox_inches = 'tight')
In [ ]:
fig, ax=plt.subplots(figsize=(12,8))
sns.countplot('Vehicle_Age',hue='Response',data=df_train ,ax=ax,palette='husl', order=["< 1 Year", "1-2 Year", "> 2 Years"])
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
plt.xlabel(xlabel = 'Vehicle Age',fontsize=15)
plt.ylabel(ylabel = 'Count',fontsize=15)

plt.text(x =-0.9, y= 188000, s = 'In general, customers with older vehicles also prefer a interested', fontweight = 'bold', fontsize = 22)
plt.text(x =-0.9, y= 179000, s = 'response, compared to owners with younger vehicles', fontweight = 'bold', fontsize = 22)
plt.legend(title = 'Get insurance', labels = ['No','Yes'], fontsize = 12)

df_total = df_train.groupby('Vehicle_Age').count().reset_index()
total = []
total.append(df_total.iloc[1,1])
total.append(df_total.iloc[0,1])
total.append(df_total.iloc[2,1])

df_year = df_train[df_train['Response'] == 1].groupby('Vehicle_Age').count().reset_index()
percentage_year = []
percentage_year.append(df_year.iloc[1,1]*100/total[0])
percentage_year.append(df_year.iloc[0,1]*100/total[1])
percentage_year.append(df_year.iloc[2,1]*100/total[2])


count = []
count.append(df_year.iloc[1,1])
count.append(df_year.iloc[0,1])
count.append(df_year.iloc[2,1])
for i in range(0,len(percentage_year)):
    plt.text(x = i 
             , y = count[i] + 3000
             , s=str(round(percentage_year[i],4)) + '%'
             , fontsize=20
             , fontweight='bold')

Insight: Mayoritas Pelanggan yang memiliki kendaraan dengan umur lebih tua (1-2 tahun & >2 tahun) tampak lebih responsif dengan penawaran Asuransi Kendaraan. Secara umum, pelanggan dengan kendaraan tua juga lebih memilih respon 'tertarik', dibandingkan dengan kendaraan muda

Insights & Tips Summary:

  1. Mayoritas Pelanggan yang memiliki kendaraan dengan umur lebih tua (1-2 tahun & >2 tahun) tampak lebih responsif dengan penawaran Asuransi Kendaraan. Secara umum, pelanggan dengan kendaraan tua juga lebih memilih respon 'tertarik', dibandingkan dengan kendaraan muda
  2. Orang-orang pada region code 28 merupakan tempat yang cocok untuk meningkatkan jumlah pelanggan
  3. Tingkat penerimaan asuransi pada region code 28 paling tinggi, yaitu sampai menyentuh 23,03% disusul dengan Region Code 29, lalu 41, dan 11
  4. Banyaknya orang yang mengasuransikan kendaraanya di region 28 dikarenakan orang-orang pada region 28 sudah pernah merasakan kendaraannya rusak.
  5. Potensial buyer adalah rentang usia 30 sampai 64 tahun

STAGE 3

1. Research Model

In [ ]:
df_results = pd.DataFrame(columns = ['Method', 'Precision', 'Recall', 'AUC'])
In [ ]:
df_results
In [ ]:
df_results.sort_values('AUC', ascending = False)
Out[ ]:
Method Precision Recall AUC
6 XGB 0.736174 0.935070 0.856998
4 Random Forest 0.734593 0.936575 0.855640
5 ANN 0.723544 0.953317 0.848790
0 Logistic Regression 0.707937 0.969439 0.841565
3 Decision Tree 0.735635 0.906103 0.838452
1 KNN 0.725487 0.890070 0.829802
2 Naive Bayes 0.725487 0.890070 0.829802
In [ ]:
df_results.sort_values('Precision', ascending = False)
Out[ ]:
Method Precision Recall AUC
6 XGB 0.736174 0.935070 0.856998
3 Decision Tree 0.735635 0.906103 0.838452
4 Random Forest 0.734593 0.936575 0.855640
1 KNN 0.725487 0.890070 0.829802
2 Naive Bayes 0.725487 0.890070 0.829802
5 ANN 0.723544 0.953317 0.848790
0 Logistic Regression 0.707937 0.969439 0.841565
In [ ]:
df_results.sort_values('Recall', ascending = False)
Out[ ]:
Method Precision Recall AUC
0 Logistic Regression 0.707937 0.969439 0.841565
5 ANN 0.723544 0.953317 0.848790
4 Random Forest 0.734593 0.936575 0.855640
6 XGB 0.736174 0.935070 0.856998
3 Decision Tree 0.735635 0.906103 0.838452
1 KNN 0.725487 0.890070 0.829802
2 Naive Bayes 0.725487 0.890070 0.829802

Logistic

In [ ]:
from sklearn.model_selection import RandomizedSearchCV
from sklearn.linear_model import LogisticRegression
from scipy.stats import uniform

# Hyperparameters
penalty = ['l2']
C = [0.2,0.22,0.24, 0.26, 0.28, 0.3, 0.32, 0.36]

# Dict
hyperparameters = dict(penalty=penalty, C=C)

classifier = LogisticRegression(random_state = 42)

clf = RandomizedSearchCV(classifier, hyperparameters, cv = 5, random_state=42, scoring='', verbose = 1, n_jobs=-1)
best_model = clf.fit(X_train, y_train)

print(best_model.best_estimator_)

y_pred = best_model.predict(X_test)
y_pred_proba = best_model.predict_proba(X_test)
Fitting 5 folds for each of 8 candidates, totalling 40 fits
/opt/conda/lib/python3.7/site-packages/sklearn/model_selection/_search.py:282: UserWarning: The total space of parameters 8 is smaller than n_iter=10. Running 8 iterations. For exhaustive searches, use GridSearchCV.
  % (grid_size, self.n_iter, grid_size), UserWarning)
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  40 out of  40 | elapsed:    8.2s finished
LogisticRegression(C=0.2, random_state=42)
In [ ]:
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score, f1_score, precision_score, recall_score
print('\nConfusion matrix')
print(confusion_matrix(y_test, best_model.predict(X_test)))

from sklearn.metrics import accuracy_score
print('\nPrecision')
print(precision_score(y_test, best_model.predict(X_test)))

print('\nRecall')
print(recall_score(y_test, best_model.predict(X_test)))

from sklearn.metrics import classification_report
print('\nClassification report')
print(classification_report(y_test, best_model.predict(X_test))) # generate the precision, recall, f-1 score, num
Confusion matrix
[[ 6774  4515]
 [  345 10944]]

Precision
0.7079371240054337

Recall
0.9694392771724688

Classification report
              precision    recall  f1-score   support

           0       0.95      0.60      0.74     11289
           1       0.71      0.97      0.82     11289

    accuracy                           0.78     22578
   macro avg       0.83      0.78      0.78     22578
weighted avg       0.83      0.78      0.78     22578

In [ ]:
df_results = df_results.append({ 'Method' : 'Logistic Regression',
                               'Precision' : precision_score(y_test, best_model.predict(X_test)),
                               'Recall' : recall_score(y_test, best_model.predict(X_test)),
                               'AUC' : roc_auc_score(y_test, best_model.predict_proba(X_test)[:,1])
                            }, ignore_index = True)
In [ ]:
from sklearn.metrics import roc_curve, auc, roc_auc_score
fpr, tpr, _ = roc_curve(y_test, best_model.predict_proba(X_test)[:,1])

plt.title('Logistic Regression')
plt.xlabel('FPR (Precision)')
plt.ylabel('TPR (Recall)')

plt.plot(fpr,tpr)
plt.plot((0,1), ls='dashed',color='black')
plt.show()
print ('Area under curve (AUC): ', auc(fpr,tpr))
print (roc_auc_score(y_test, best_model.predict_proba(X_test)[:,1]))
Area under curve (AUC):  0.8415650775228899
0.8415650775228899
In [ ]:
import pickle
filename = 'logistic.sav'
pickle.dump(best_model, open(filename, 'wb'))
In [ ]:
filename = 'logistic.sav'
best_model = pickle.load(open(filename, 'rb'))
best_model.best_estimator_
Out[ ]:
LogisticRegression(C=0.2, random_state=42)

KNN

In [ ]:
from sklearn.model_selection import RandomizedSearchCV
from sklearn.neighbors import KNeighborsClassifier
from scipy.stats import uniform

# Hyperparameters
n_neighbors = [3, 5, 7, 9, 11, 13]
metric = ['euclidean', 'manhattan', 'minkowski']
algorithm = ['auto', 'ball_tree', 'kd_tree', 'brute']

# Dict
hyperparameters = dict(n_neighbors=n_neighbors, metric=metric, algorithm = algorithm)

classifier = KNeighborsClassifier()

clf = RandomizedSearchCV(classifier, hyperparameters, cv = 5, random_state=42, scoring='roc_auc', verbose = 1, n_jobs = -1)
best_model = clf.fit(X_train, y_train)

print(best_model.best_estimator_)

y_pred = best_model.predict(X_test)
y_pred_proba = best_model.predict_proba(X_test)
Fitting 5 folds for each of 10 candidates, totalling 50 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:  3.1min
[Parallel(n_jobs=-1)]: Done  50 out of  50 | elapsed:  4.0min finished
KNeighborsClassifier(algorithm='ball_tree', n_neighbors=11)
In [ ]:
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score, f1_score, precision_score, recall_score
print('\nConfusion matrix')
print(confusion_matrix(y_test, best_model.predict(X_test)))

from sklearn.metrics import accuracy_score
print('\nPrecision')
print(precision_score(y_test, best_model.predict(X_test)))

print('\nRecall')
print(recall_score(y_test, best_model.predict(X_test)))

from sklearn.metrics import classification_report
print('\nClassification report')
print(classification_report(y_test, best_model.predict(X_test))) # generate the precision, recall, f-1 score, num
Confusion matrix
[[ 7487  3802]
 [ 1241 10048]]

Precision
0.7254873646209387

Recall
0.8900699796261847

Classification report
              precision    recall  f1-score   support

           0       0.86      0.66      0.75     11289
           1       0.73      0.89      0.80     11289

    accuracy                           0.78     22578
   macro avg       0.79      0.78      0.77     22578
weighted avg       0.79      0.78      0.77     22578

In [ ]:
df_results = df_results.append({ 'Method' : 'KNN',
                               'Precision' : precision_score(y_test, best_model.predict(X_test)),
                               'Recall' : recall_score(y_test, best_model.predict(X_test)),
                               'AUC' : roc_auc_score(y_test, best_model.predict_proba(X_test)[:,1])
                            }, ignore_index = True)
In [ ]:
from sklearn.metrics import roc_curve, auc, roc_auc_score
fpr, tpr, _ = roc_curve(y_test, best_model.predict_proba(X_test)[:,1])

plt.title('KNN')
plt.xlabel('FPR (Precision)')
plt.ylabel('TPR (Recall)')

plt.plot(fpr,tpr)
plt.plot((0,1), ls='dashed',color='black')
plt.show()
print ('Area under curve (AUC): ', auc(fpr,tpr))
print (roc_auc_score(y_test, best_model.predict_proba(X_test)[:,1]))
Area under curve (AUC):  0.829802058781141
0.829802058781141
In [ ]:
import pickle
filename = 'knn.sav'
pickle.dump(best_model, open(filename, 'wb'))
In [ ]:
filename = 'knn.sav'
best_model = pickle.load(open(filename, 'rb'))
best_model.best_estimator_
Out[ ]:
KNeighborsClassifier(algorithm='ball_tree', n_neighbors=11)

Naive Bayes

In [ ]:
from sklearn.naive_bayes import GaussianNB

classifier = GaussianNB()
classifier.fit(X_train, y_train)

y_pred = classifier.predict(X_test)
y_pred_proba = classifier.predict_proba(X_test)
In [ ]:
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score, f1_score, precision_score, recall_score
print('\nConfusion matrix')
print(confusion_matrix(y_test, best_model.predict(X_test)))

from sklearn.metrics import accuracy_score
print('\nPrecision')
print(precision_score(y_test, best_model.predict(X_test)))

print('\nRecall')
print(recall_score(y_test, best_model.predict(X_test)))

from sklearn.metrics import classification_report
print('\nClassification report')
print(classification_report(y_test, best_model.predict(X_test))) # generate the precision, recall, f-1 score, num
Confusion matrix
[[ 7496  3793]
 [ 1088 10201]]

Precision
0.7289552665428041

Recall
0.9036229958366552

Classification report
              precision    recall  f1-score   support

           0       0.87      0.66      0.75     11289
           1       0.73      0.90      0.81     11289

    accuracy                           0.78     22578
   macro avg       0.80      0.78      0.78     22578
weighted avg       0.80      0.78      0.78     22578

In [ ]:
df_results = df_results.append({ 'Method' : 'Naive Bayes',
                               'Precision' : precision_score(y_test, best_model.predict(X_test)),
                               'Recall' : recall_score(y_test, best_model.predict(X_test)),
                               'AUC' : roc_auc_score(y_test, best_model.predict_proba(X_test)[:,1])
                            }, ignore_index = True)
In [ ]:
from sklearn.metrics import roc_curve, auc, roc_auc_score
fpr, tpr, _ = roc_curve(y_test, best_model.predict_proba(X_test)[:,1])

plt.title('Naive Bayes')
plt.xlabel('FPR (Precision)')
plt.ylabel('TPR (Recall)')

plt.plot(fpr,tpr)
plt.plot((0,1), ls='dashed',color='black')
plt.show()
print ('Area under curve (AUC): ', auc(fpr,tpr))
print (roc_auc_score(y_test, best_model.predict_proba(X_test)[:,1]))
Area under curve (AUC):  0.829802058781141
0.829802058781141
In [ ]:
import pickle
filename = 'naiveb.sav'
pickle.dump(classifier, open(filename, 'wb'))
In [ ]:
filename = 'naiveb.sav'
best_model = pickle.load(open(filename, 'rb'))
print("Done")
best_model.best_estimator_
Done
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-105-3c65ae7463ab> in <module>
      2 best_model = pickle.load(open(filename, 'rb'))
      3 print("Done")
----> 4 best_model.best_estimator_

AttributeError: 'GaussianNB' object has no attribute 'best_estimator_'

Decision Tree

In [ ]:
from sklearn.model_selection import RandomizedSearchCV
from sklearn.tree import DecisionTreeClassifier

#Hyper Parameter

max_depth = [int(x) for x in np.linspace(1, 110, num = 30)] # Maximum number of levels in tree
min_samples_split = [2, 5, 10, 100] # Minimum number of samples required to split a node
min_samples_leaf = [1, 2, 4, 10, 20, 50] # Minimum number of samples required at each leaf node
max_features = ['auto', 'sqrt'] # Number of features to consider at every split
criterion= ['gini', 'entropy']
# Dict
hyperparameters = {
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'max_features': max_features,
               'criterion' : criterion
                }

classifier = DecisionTreeClassifier(random_state = 42)

clf = RandomizedSearchCV(classifier, hyperparameters, cv = 5, random_state=42, n_iter = 15, scoring='roc_auc', verbose = 1, n_jobs = -1)
best_model = clf.fit(X_train, y_train)

print(best_model.best_estimator_)

y_pred = best_model.predict(X_test)
y_pred_proba = best_model.predict_proba(X_test)
Fitting 5 folds for each of 15 candidates, totalling 75 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:    3.2s
[Parallel(n_jobs=-1)]: Done  75 out of  75 | elapsed:    4.3s finished
DecisionTreeClassifier(max_depth=83, max_features='sqrt', min_samples_leaf=10,
                       min_samples_split=100, random_state=42)
In [ ]:
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score, f1_score, precision_score, recall_score
print('\nConfusion matrix')
print(confusion_matrix(y_test, best_model.predict(X_test)))

from sklearn.metrics import accuracy_score
print('\nPrecision')
print(precision_score(y_test, best_model.predict(X_test)))

print('\nRecall')
print(recall_score(y_test, best_model.predict(X_test)))

from sklearn.metrics import classification_report
print('\nClassification report')
print(classification_report(y_test, best_model.predict(X_test))) # generate the precision, recall, f-1 score, num
Confusion matrix
[[ 7613  3676]
 [ 1060 10229]]

Precision
0.7356346637900036

Recall
0.9061032863849765

Classification report
              precision    recall  f1-score   support

           0       0.88      0.67      0.76     11289
           1       0.74      0.91      0.81     11289

    accuracy                           0.79     22578
   macro avg       0.81      0.79      0.79     22578
weighted avg       0.81      0.79      0.79     22578

In [ ]:
df_results = df_results.append({ 'Method' : 'Decision Tree',
                               'Precision' : precision_score(y_test, best_model.predict(X_test)),
                               'Recall' : recall_score(y_test, best_model.predict(X_test)),
                               'AUC' : roc_auc_score(y_test, best_model.predict_proba(X_test)[:,1])
                            }, ignore_index = True)
In [ ]:
from sklearn.metrics import roc_curve, auc, roc_auc_score
fpr, tpr, _ = roc_curve(y_test, best_model.predict_proba(X_test)[:,1])

plt.title('Logistic Regression')
plt.xlabel('FPR (Precision)')
plt.ylabel('TPR (Recall)')

plt.plot(fpr,tpr)
plt.plot((0,1), ls='dashed',color='black')
plt.show()
print ('Area under curve (AUC): ', auc(fpr,tpr))
print (roc_auc_score(y_test, best_model.predict_proba(X_test)[:,1]))
Area under curve (AUC):  0.8384523596512945
0.8384523596512945
In [ ]:
importance = best_model.best_estimator_.feature_importances_
feat_importances = pd.Series(importance, index= pd.Series(df.drop('Response', axis = 1).columns))
# feat_importances.plot(kind ="barh")
feat_importances.nlargest(10).plot(kind='barh')
plt.xlabel('score')
plt.ylabel('feature')
plt.title('feature importance score')
Out[ ]:
Text(0.5, 1.0, 'feature importance score')
In [ ]:
import pickle
filename = 'dectree.sav'
pickle.dump(best_model, open(filename, 'wb'))
In [ ]:
filename = 'dectree.sav'
best_model = pickle.load(open(filename, 'rb'))
best_model.best_estimator_
/usr/local/lib/python3.6/dist-packages/sklearn/base.py:318: UserWarning: Trying to unpickle estimator DecisionTreeClassifier from version 0.23.2 when using version 0.22.2.post1. This might lead to breaking code or invalid results. Use at your own risk.
  UserWarning)
/usr/local/lib/python3.6/dist-packages/sklearn/base.py:318: UserWarning: Trying to unpickle estimator RandomizedSearchCV from version 0.23.2 when using version 0.22.2.post1. This might lead to breaking code or invalid results. Use at your own risk.
  UserWarning)
Out[ ]:
DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='entropy',
                       max_depth=53, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=50, min_samples_split=100,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=42, splitter='best')

Random forest

In [ ]:
from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier

#Hyper Parameter

n_estimators = [int(x) for x in np.linspace(start = 100, stop = 2000, num = 20)] # Number of trees in random forest
max_features = ['auto', 'sqrt', 'log2'] # Number of features to consider at every split
max_depth = [int(x) for x in np.linspace(10, 110, num = 5)] # Maximum number of levels in tree
min_samples_split = [int(x) for x in np.linspace(start = 2, stop = 10, num = 5)] # Minimum number of samples required to split a node
min_samples_leaf = [int(x) for x in np.linspace(start = 1, stop = 10, num = 5)] # Minimum number of samples required at each leaf node
bootstrap = [True, False] # Method of selecting samples for training each tree
n_jobs = [-1]

#Menjadikan ke dalam bentuk dictionary
random_search = {'criterion': ['entropy','gini'],
               'max_depth': max_depth,
               'min_samples_leaf': min_samples_split,
               'min_samples_split': min_samples_leaf,
               'n_estimators': n_estimators,
                'max_features' : max_features}

# random_search = {'criterion': ['entropy','gini'],
#                'max_depth': [10],
#                'min_samples_leaf': [6],
#                'min_samples_split': [7],
#                'n_estimators': [300]}

classifier = RandomForestClassifier(random_state = 42)

clf = RandomizedSearchCV(classifier, random_search, cv = 5, random_state=42, scoring='roc_auc', verbose = 4, n_jobs = -1)
best_model = clf.fit(X_train, y_train)

print(best_model.best_estimator_)

y_pred = best_model.predict(X_test)
y_pred_proba = best_model.predict_proba(X_test)
Fitting 5 folds for each of 10 candidates, totalling 50 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
/opt/conda/lib/python3.7/site-packages/joblib/externals/loky/process_executor.py:691: UserWarning: A worker stopped while some jobs were given to the executor. This can be caused by a too short worker timeout or by a memory leak.
  "timeout or by a memory leak.", UserWarning
[Parallel(n_jobs=-1)]: Done  17 tasks      | elapsed:  3.9min
[Parallel(n_jobs=-1)]: Done  50 out of  50 | elapsed: 14.6min finished
RandomForestClassifier(criterion='entropy', max_depth=10, max_features='sqrt',
                       min_samples_leaf=8, min_samples_split=7,
                       random_state=42)
In [ ]:
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score, f1_score
print('\nConfusion matrix')
print(confusion_matrix(y_test, y_pred))

from sklearn.metrics import accuracy_score
print('\nAccuracy')
print(accuracy_score(y_test, y_pred))

print('\nF1_Score')
print(f1_score(y_test, best_model.predict(X_test)))

from sklearn.metrics import classification_report
print('\nClassification report')
print(classification_report(y_test, y_pred)) # generate the precision, recall, f-1 score, num
Confusion matrix
[[ 7496  3793]
 [ 1088 10201]]

Accuracy
0.783816104172203

F1_Score
0.8233782415699712

Classification report
              precision    recall  f1-score   support

           0       0.87      0.66      0.75     11289
           1       0.73      0.90      0.81     11289

    accuracy                           0.78     22578
   macro avg       0.80      0.78      0.78     22578
weighted avg       0.80      0.78      0.78     22578

In [ ]:
df_results = df_results.append({ 'Method' : 'Random Forest',
                               'Precision' : precision_score(y_test, best_model.predict(X_test)),
                               'Recall' : recall_score(y_test, best_model.predict(X_test)),
                               'AUC' : roc_auc_score(y_test, best_model.predict_proba(X_test)[:,1])
                            }, ignore_index = True)
In [ ]:
from sklearn.metrics import roc_curve, auc
fpr, tpr, _ = roc_curve(y_test, y_pred_proba[:,1])

plt.title('Random Forest ROC curve: CC Fraud')
plt.xlabel('FPR (Precision)')
plt.ylabel('TPR (Recall)')

plt.plot(fpr,tpr)
plt.plot((0,1), ls='dashed',color='black')
plt.show()
print ('Area under curve (AUC): ', auc(fpr,tpr))
Area under curve (AUC):  0.8556403568033373
In [ ]:
roc_auc_score(y_test, y_pred_proba[:,1])
Out[ ]:
0.8556403568033373
In [ ]:
import pickle
filename = 'rforest1.sav'
pickle.dump(best_model, open(filename, 'wb'))
In [ ]:
filename = 'rforest1.sav'
best_model = pickle.load(open(filename, 'rb'))
best_model.best_estimator_
Out[ ]:
RandomForestClassifier(criterion='entropy', max_depth=10, max_features='sqrt',
                       min_samples_leaf=8, min_samples_split=7,
                       random_state=42)
In [ ]:
importance = best_model.best_estimator_.feature_importances_
feat_importances = pd.Series(importance, index= pd.Series(df.drop('Response', axis = 1).columns))
# feat_importances.plot(kind ="barh")
feat_importances.nlargest(10).plot(kind='barh')
plt.xlabel('score')
plt.ylabel('feature')
plt.title('feature importance score')
Out[ ]:
Text(0.5, 1.0, 'feature importance score')

Artificial Neural Network

In [ ]:
from keras.models import Sequential
from keras.layers import Dense
from keras.wrappers.scikit_learn import KerasClassifier
from sklearn.model_selection import StratifiedKFold
from keras.metrics import AUC
from sklearn.model_selection import cross_val_score
In [ ]:
def create_baseline():
    model = Sequential()
    model.add(Dense(15, input_dim = 15, activation='relu'))
    model.add(Dense(1, activation='sigmoid'))
    
    model.compile(loss='binary_crossentropy', optimizer = 'adam', metrics=[AUC()])
    return model
In [ ]:
cvscores = []
kfold = StratifiedKFold(n_splits = 3, shuffle = True, random_state = 42)
for train, test in kfold.split(X_train,y_train):
    model = create_baseline()
    history = model.fit(X_train, y_train, epochs = 3, batch_size = 32, verbose = 1, validation_data =(X_test,y_test))
    scores = model.evaluate(X_test, y_test, verbose = 1)
    print("\n %s: %.2f%%\n---------\n" % (model.metrics_names[1], scores[1]*100))
    cvscores.append(scores[1] * 100)
print("AUC Result for Testing")
print("%.2f%% (+/- %.2f%%)" % (np.mean(cvscores), np.std(cvscores)))
# kfold = StratifiedKFold(n_splits = 3, shuffle = True)
# results = cross_val_score(model,  X, y, cv = kfold, scoring = 'roc_auc')
# print("Baseline: %.2f%% (%.2f%%)" % (results.mean()*100, results.std()*100))
Epoch 1/3
2117/2117 [==============================] - 5s 2ms/step - loss: 0.5156 - auc: 0.7924 - val_loss: 0.4311 - val_auc: 0.8460
Epoch 2/3
2117/2117 [==============================] - 3s 1ms/step - loss: 0.4349 - auc: 0.8403 - val_loss: 0.4294 - val_auc: 0.8476
Epoch 3/3
2117/2117 [==============================] - 3s 1ms/step - loss: 0.4300 - auc: 0.8456 - val_loss: 0.4274 - val_auc: 0.8488
706/706 [==============================] - 1s 1ms/step - loss: 0.4274 - auc: 0.8488

 auc: 84.88%
---------

Epoch 1/3
2117/2117 [==============================] - 4s 2ms/step - loss: 0.4926 - auc_1: 0.8144 - val_loss: 0.4309 - val_auc_1: 0.8454
Epoch 2/3
2117/2117 [==============================] - 4s 2ms/step - loss: 0.4321 - auc_1: 0.8428 - val_loss: 0.4290 - val_auc_1: 0.8466
Epoch 3/3
2117/2117 [==============================] - 4s 2ms/step - loss: 0.4338 - auc_1: 0.8418 - val_loss: 0.4278 - val_auc_1: 0.8481
706/706 [==============================] - 1s 1ms/step - loss: 0.4278 - auc_1: 0.8481

 auc_1: 84.81%
---------

Epoch 1/3
2117/2117 [==============================] - 4s 2ms/step - loss: 0.4937 - auc_2: 0.8141 - val_loss: 0.4298 - val_auc_2: 0.8455
Epoch 2/3
2117/2117 [==============================] - 3s 2ms/step - loss: 0.4308 - auc_2: 0.8422 - val_loss: 0.4280 - val_auc_2: 0.8487
Epoch 3/3
2117/2117 [==============================] - 3s 2ms/step - loss: 0.4314 - auc_2: 0.8445 - val_loss: 0.4281 - val_auc_2: 0.8494
706/706 [==============================] - 1s 967us/step - loss: 0.4281 - auc_2: 0.8494

 auc_2: 84.94%
---------

AUC Result for Testing
84.88% (+/- 0.05%)
In [ ]:
model.summary()
Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
dense_4 (Dense)              (None, 15)                240       
_________________________________________________________________
dense_5 (Dense)              (None, 1)                 16        
=================================================================
Total params: 256
Trainable params: 256
Non-trainable params: 0
_________________________________________________________________
In [ ]:
y_pred = model.predict(X_test, batch_size = 32)
y_pred = np.where(y_pred >= 0.5, 1, 0)

from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score, f1_score, precision_score, recall_score
print('\nConfusion matrix')
print(confusion_matrix(y_test, best_model.predict(X_test)))

from sklearn.metrics import accuracy_score
print('\nPrecision')
print(precision_score(y_test, best_model.predict(X_test)))

print('\nRecall')
print(recall_score(y_test, best_model.predict(X_test)))

from sklearn.metrics import classification_report
print('\nClassification report')
print(classification_report(y_test, best_model.predict(X_test))) # generate the precision, recall, f-1 score, num
Confusion matrix
[[ 7469  3820]
 [  716 10573]]

Precision
0.7345932050302231

Recall
0.9365754274072106

Classification report
              precision    recall  f1-score   support

           0       0.91      0.66      0.77     11289
           1       0.73      0.94      0.82     11289

    accuracy                           0.80     22578
   macro avg       0.82      0.80      0.80     22578
weighted avg       0.82      0.80      0.80     22578

In [ ]:
from sklearn.metrics import roc_curve, auc
fpr, tpr, _ = roc_curve(y_test, model.predict(X_test, batch_size = 32)[:,0])

plt.title('Random Forest ROC curve: CC Fraud')
plt.xlabel('FPR (Precision)')
plt.ylabel('TPR (Recall)')

plt.plot(fpr,tpr)
plt.plot((0,1), ls='dashed',color='black')
plt.show()
print ('Area under curve (AUC): ', auc(fpr,tpr))
Area under curve (AUC):  0.8494722650085131
In [ ]:
import h5py

#Serialize model to JSON
model_json = model.to_json()
with open("model.json", "w") as json_file:
    json_file.write(model_json)
#Serialize weights to HDF5
model.save_weights("ANN.h5")
print("Saved model to disk")
Saved model to disk
In [ ]:
df_results = df_results.append({ 'Method' : 'ANN',
                               'Precision' : precision_score(y_test, y_pred),
                               'Recall' : recall_score(y_test, y_pred),
                               'AUC' : np.mean(cvscores)/100
                            }, ignore_index = True)
In [ ]:
 

XGBoost

In [ ]:
from sklearn.model_selection import RandomizedSearchCV
import xgboost as xgb

#Hyper Paramete
hyperparameters = {
                    "learning_rate": [0.05, 0.10, 0.15, 0.20, 0.25, 0.30 ],
                    "max_depth": [ 3, 4, 5, 6, 8, 10, 12, 15],
                    "min_child_weight": [ 1, 3, 5, 7 ],
                    "gamma": [ 0.0, 0.1, 0.2 , 0.3, 0.4 ],
                    "colsample_bytree": [ 0.3, 0.4, 0.5 , 0.7 ],
                    "eta":[.3, .2, .1, .05, .01, .005]
                  }

classifier = xgb.XGBClassifier(random_state = 42)

clf = RandomizedSearchCV(classifier, hyperparameters, cv = 5, random_state=42, scoring='roc_auc', verbose = 4, n_jobs = -1)
best_model = clf.fit(X_train, y_train)

print(best_model.best_estimator_)

y_pred = best_model.predict(X_test)
y_pred_proba = best_model.predict_proba(X_test)
Fitting 5 folds for each of 10 candidates, totalling 50 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
<ipython-input-22-5e34bd1fd38a> in <module>()
     15 
     16 clf = RandomizedSearchCV(classifier, hyperparameters, cv = 5, random_state=42, scoring='roc_auc', verbose = 4, n_jobs = -1)
---> 17 best_model = clf.fit(X_train, y_train)
     18 
     19 print(best_model.best_estimator_)

/usr/local/lib/python3.6/dist-packages/sklearn/model_selection/_search.py in fit(self, X, y, groups, **fit_params)
    708                 return results
    709 
--> 710             self._run_search(evaluate_candidates)
    711 
    712         # For multi-metric evaluation, store the best_index_, best_params_ and

/usr/local/lib/python3.6/dist-packages/sklearn/model_selection/_search.py in _run_search(self, evaluate_candidates)
   1482         evaluate_candidates(ParameterSampler(
   1483             self.param_distributions, self.n_iter,
-> 1484             random_state=self.random_state))

/usr/local/lib/python3.6/dist-packages/sklearn/model_selection/_search.py in evaluate_candidates(candidate_params)
    687                                for parameters, (train, test)
    688                                in product(candidate_params,
--> 689                                           cv.split(X, y, groups)))
    690 
    691                 if len(out) < 1:

/usr/local/lib/python3.6/dist-packages/joblib/parallel.py in __call__(self, iterable)
   1052 
   1053             with self._backend.retrieval_context():
-> 1054                 self.retrieve()
   1055             # Make sure that we get a last message telling us we are done
   1056             elapsed_time = time.time() - self._start_time

/usr/local/lib/python3.6/dist-packages/joblib/parallel.py in retrieve(self)
    931             try:
    932                 if getattr(self._backend, 'supports_timeout', False):
--> 933                     self._output.extend(job.get(timeout=self.timeout))
    934                 else:
    935                     self._output.extend(job.get())

/usr/local/lib/python3.6/dist-packages/joblib/_parallel_backends.py in wrap_future_result(future, timeout)
    540         AsyncResults.get from multiprocessing."""
    541         try:
--> 542             return future.result(timeout=timeout)
    543         except CfTimeoutError as e:
    544             raise TimeoutError from e

/usr/lib/python3.6/concurrent/futures/_base.py in result(self, timeout)
    425                 return self.__get_result()
    426 
--> 427             self._condition.wait(timeout)
    428 
    429             if self._state in [CANCELLED, CANCELLED_AND_NOTIFIED]:

/usr/lib/python3.6/threading.py in wait(self, timeout)
    293         try:    # restore state no matter what (e.g., KeyboardInterrupt)
    294             if timeout is None:
--> 295                 waiter.acquire()
    296                 gotit = True
    297             else:

KeyboardInterrupt: 
In [ ]:
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score, f1_score, precision_score, recall_score
print('\nConfusion matrix')
print(confusion_matrix(y_test, best_model.predict(X_test)))

from sklearn.metrics import accuracy_score
print('\nPrecision')
print(precision_score(y_test, best_model.predict(X_test)))

print('\nRecall')
print(recall_score(y_test, best_model.predict(X_test)))

from sklearn.metrics import classification_report
print('\nClassification report')
print(classification_report(y_test, best_model.predict(X_test))) # generate the precision, recall, f-1 score, num
Confusion matrix
[[ 7506  3783]
 [  733 10556]]

Precision
0.7361740707162284

Recall
0.9350695367171583

Classification report
              precision    recall  f1-score   support

           0       0.91      0.66      0.77     11289
           1       0.74      0.94      0.82     11289

    accuracy                           0.80     22578
   macro avg       0.82      0.80      0.80     22578
weighted avg       0.82      0.80      0.80     22578

In [ ]:
from sklearn.metrics import roc_curve, auc
fpr, tpr, _ = roc_curve(y_test, y_pred_proba[:,1])

plt.title('Random Forest ROC curve: CC Fraud')
plt.xlabel('FPR (Precision)')
plt.ylabel('TPR (Recall)')

plt.plot(fpr,tpr)
plt.plot((0,1), ls='dashed',color='black')
plt.show()
print ('Area under curve (AUC): ', auc(fpr,tpr))
Area under curve (AUC):  0.8569978029373959
In [ ]:
df_results = df_results.append({ 'Method' : 'XGB',
                               'Precision' : precision_score(y_test, best_model.predict(X_test)),
                               'Recall' : recall_score(y_test, best_model.predict(X_test)),
                               'AUC' : roc_auc_score(y_test, best_model.predict_proba(X_test)[:,1])
                            }, ignore_index = True)
In [ ]:
importance = best_model.best_estimator_.feature_importances_
feat_importances = pd.Series(importance, index= pd.Series(df.drop('Response', axis = 1).columns))
# feat_importances.plot(kind ="barh")
feat_importances.nlargest(10).plot(kind='barh')
plt.xlabel('score')
plt.ylabel('feature')
plt.title('feature importance score')
In [ ]:
import pickle
filename = 'xgb.sav'
pickle.dump(best_model, open(filename, 'wb'))
In [ ]:
import pickle
import xgboost as xgb

filename = 'xgb.sav'
best_model = pickle.load(open(filename, 'rb'))
best_model.best_estimator_
Out[ ]:
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=0.7, eta=0.005, gamma=0.0,
              gpu_id=-1, importance_type='gain', interaction_constraints='',
              learning_rate=0.1, max_delta_step=0, max_depth=4,
              min_child_weight=5, missing=nan, monotone_constraints='()',
              n_estimators=100, n_jobs=4, num_parallel_tree=1,
              objective='binary:logistic', random_state=42, reg_alpha=0,
              reg_lambda=1, scale_pos_weight=1, subsample=1,
              tree_method='exact', use_label_encoder=True,
              validate_parameters=1, verbosity=None)
In [ ]:
pip install xgboost --upgrade
Collecting xgboost
  Downloading https://files.pythonhosted.org/packages/2e/57/bf5026701c384decd2b995eb39d86587a103ba4eb26f8a9b1811db0896d3/xgboost-1.3.3-py3-none-manylinux2010_x86_64.whl (157.5MB)
     |████████████████████████████████| 157.5MB 90kB/s 
Requirement already satisfied, skipping upgrade: scipy in /usr/local/lib/python3.6/dist-packages (from xgboost) (1.4.1)
Requirement already satisfied, skipping upgrade: numpy in /usr/local/lib/python3.6/dist-packages (from xgboost) (1.19.5)
Installing collected packages: xgboost
  Found existing installation: xgboost 0.90
    Uninstalling xgboost-0.90:
      Successfully uninstalled xgboost-0.90
Successfully installed xgboost-1.3.3
In [ ]:
from sklearn import tree
import matplotlib.pyplot as plt
from xgboost import plot_tree
fig, ax = plt.subplots(figsize=(80,80))
plot_tree(best_model.best_estimator_, ax = ax, num_trees = 0 )
Out[ ]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fa6936fcb70>
In [ ]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 90310 entries, 0 to 90309
Data columns (total 16 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Age                   90310 non-null  float16
 1   Driving_License       90310 non-null  int8   
 2   Previously_Insured    90310 non-null  int8   
 3   Annual_Premium        90310 non-null  float16
 4   Policy_Sales_Channel  90310 non-null  float16
 5   Vintage               90310 non-null  float16
 6   Gender_Male           90310 non-null  int8   
 7   Vehicle_Age_> 1 Year  90310 non-null  int8   
 8   Vehicle_Damage_Yes    90310 non-null  int8   
 9   Region_Code_1         90310 non-null  int8   
 10  Region_Code_2         90310 non-null  int8   
 11  Region_Code_3         90310 non-null  int8   
 12  Region_Code_4         90310 non-null  int8   
 13  Region_Code_5         90310 non-null  int8   
 14  Region_Code_6         90310 non-null  int8   
 15  Response              90310 non-null  int8   
dtypes: float16(4), int8(12)
memory usage: 1.7 MB

Ada Boost

In [ ]:
from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import AdaBoostClassifier

#Hyper Paramete
hyperparameters = {
                   'n_estimators': [10, 50, 100, 500, 1000, 5000],
                   'learning_rate':np.arange(0.1, 2.1, 0.1)
                  }

classifier = AdaBoostClassifier(random_state=42)

clf = RandomizedSearchCV(classifier, hyperparameters, cv = 5, random_state=42, scoring='roc_auc', verbose = 4, n_jobs = -1)
best_model = clf.fit(X_train, y_train)

print(best_model.best_estimator_)

y_pred = best_model.predict(X_test)
y_pred_proba = best_model.predict_proba(X_test)
Fitting 5 folds for each of 10 candidates, totalling 50 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  17 tasks      | elapsed:  8.4min
[Parallel(n_jobs=-1)]: Done  50 out of  50 | elapsed: 25.3min finished
AdaBoostClassifier(learning_rate=0.1, n_estimators=1000, random_state=42)
In [ ]:
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score, f1_score, precision_score, recall_score
print('\nConfusion matrix')
print(confusion_matrix(y_test, best_model.predict(X_test)))

from sklearn.metrics import accuracy_score
print('\nPrecision')
print(precision_score(y_test, best_model.predict(X_test)))

print('\nRecall')
print(recall_score(y_test, best_model.predict(X_test)))

from sklearn.metrics import classification_report
print('\nClassification report')
print(classification_report(y_test, best_model.predict(X_test))) # generate the precision, recall, f-1 score, num
Confusion matrix
[[ 7478  3811]
 [  782 10507]]

Precision
0.7338315407179774

Recall
0.9307290282575958

Classification report
              precision    recall  f1-score   support

           0       0.91      0.66      0.77     11289
           1       0.73      0.93      0.82     11289

    accuracy                           0.80     22578
   macro avg       0.82      0.80      0.79     22578
weighted avg       0.82      0.80      0.79     22578

In [ ]:
from sklearn.metrics import roc_curve, auc
fpr, tpr, _ = roc_curve(y_test, y_pred_proba[:,1])

plt.title('Random Forest ROC curve: CC Fraud')
plt.xlabel('FPR (Precision)')
plt.ylabel('TPR (Recall)')

plt.plot(fpr,tpr)
plt.plot((0,1), ls='dashed',color='black')
plt.show()
print ('Area under curve (AUC): ', auc(fpr,tpr))
Area under curve (AUC):  0.854421268245849
In [ ]:
importance = best_model.best_estimator_.feature_importances_
feat_importances = pd.Series(importance, index= pd.Series(df.drop('Response', axis = 1).columns))
# feat_importances.plot(kind ="barh")
feat_importances.nlargest(10).plot(kind='barh')
plt.xlabel('score')
plt.ylabel('feature')
plt.title('feature importance score')
Out[ ]:
Text(0.5, 1.0, 'feature importance score')
In [ ]:
df_results = df_results.append({ 'Method' : 'AdaBoost',
                               'Precision' : precision_score(y_test, best_model.predict(X_test)),
                               'Recall' : recall_score(y_test, best_model.predict(X_test)),
                               'AUC' : roc_auc_score(y_test, best_model.predict_proba(X_test)[:,1])
                            }, ignore_index = True)
In [ ]:
import pickle
filename = 'ada.sav'
pickle.dump(best_model, open(filename, 'wb'))
In [ ]:
 

2. Selected Model Validation

In [ ]:
train = pd.read_csv('data-stage1-31012021.csv')
# split independent variabel dan dependent variabel
X = train.drop('Response', axis = 1).values
y = train['Response'].values
# split test dan train
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 42, stratify = y, shuffle = True)
In [ ]:
import xgboost as xgb
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import StratifiedKFold
roc_auc_list = []
roc_auc_holdout = []
roc_auc_train = []
folds = []

# Sudah di tuning
# model = xgb.XGBClassifier(colsample_bytree=0.7, eta=0.2, gamma=0.4,
#                           learning_rate=0.05, max_depth=8,
#                           min_child_weight=5,
#                           n_estimators=100, n_jobs=-1)
model = xgb.XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=0.7, eta=0.005, gamma=0.0,
              gpu_id=-1, importance_type='gain', interaction_constraints='',
              learning_rate=0.1, max_delta_step=0, max_depth=4,
              min_child_weight=5, monotone_constraints='()',
              n_estimators=100, n_jobs=4, num_parallel_tree=1,
              objective='binary:logistic', random_state=42, reg_alpha=0,
              reg_lambda=1, scale_pos_weight=1, subsample=1,
              tree_method='exact', use_label_encoder=True,
              validate_parameters=1, verbosity=None)
# model = xgb.XGBClassifier()
kfold = StratifiedKFold(n_splits= 10, random_state = 42,shuffle =True)
for i,(train_index, test_index) in enumerate(kfold.split(X_train, y_train)):
    X1_train, X1_valid = X[train_index], X[test_index]
    y1_train, y1_valid = y[train_index], y[test_index]
    model.fit(X1_train, y1_train)
    train_pred = model.predict_proba(X1_train)[:,1] # 70%
    #Measure of the fit of your model.
    pred = model.predict_proba(X1_valid)[:,1] # 10%
    # DATA WHICH MODEL HAS NOT SEEN
    pred_holdout = model.predict_proba(X_test)[:,1] # 20%
    
    print('Prediction length on validation set, XGBoost Classifier, fold ', i, ': ', len(pred))

    folds.append(i)
    roc_auc_list.append(roc_auc_score(y1_valid, pred))
    roc_auc_holdout.append(roc_auc_score(y_test, pred_holdout))
    roc_auc_train.append(roc_auc_score(y1_train, train_pred))
[12:28:20] WARNING: ../src/learner.cc:1061: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
Prediction length on validation set, XGBoost Classifier, fold  0 :  6774
[12:28:23] WARNING: ../src/learner.cc:1061: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
Prediction length on validation set, XGBoost Classifier, fold  1 :  6774
[12:28:26] WARNING: ../src/learner.cc:1061: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
Prediction length on validation set, XGBoost Classifier, fold  2 :  6773
[12:28:29] WARNING: ../src/learner.cc:1061: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
Prediction length on validation set, XGBoost Classifier, fold  3 :  6773
[12:28:32] WARNING: ../src/learner.cc:1061: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
Prediction length on validation set, XGBoost Classifier, fold  4 :  6773
[12:28:35] WARNING: ../src/learner.cc:1061: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
Prediction length on validation set, XGBoost Classifier, fold  5 :  6773
[12:28:38] WARNING: ../src/learner.cc:1061: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
Prediction length on validation set, XGBoost Classifier, fold  6 :  6773
[12:28:41] WARNING: ../src/learner.cc:1061: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
Prediction length on validation set, XGBoost Classifier, fold  7 :  6773
[12:28:44] WARNING: ../src/learner.cc:1061: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
Prediction length on validation set, XGBoost Classifier, fold  8 :  6773
[12:28:47] WARNING: ../src/learner.cc:1061: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
Prediction length on validation set, XGBoost Classifier, fold  9 :  6773
In [ ]:
rg = np.arange(0.840,0.870,0.005)

train_mean = np.mean(roc_auc_train)
test_mean = np.mean(roc_auc_holdout)
val_mean = np.mean(roc_auc_list)

train_std = np.std(roc_auc_train)
test_std = np.std(roc_auc_holdout)
val_std = np.std(roc_auc_list)

plt.style.use('tableau-colorblind10')

fig, ax = plt.subplots(figsize=(20,10))
ax.plot(roc_auc_train, label='Train', marker='o', linestyle='-.')
ax.plot(roc_auc_holdout, label='Test', marker='o', linestyle=':')
ax.plot(roc_auc_list, label='Val', marker='o', linestyle='--')

text_m = '''
    * Train Mean : ''' + str(format(train_mean, '.5f')) + '''
    * Test Mean : ''' + str(format(test_mean, '.5f')) + ''' 
    * Val Mean : ''' + str(format(val_mean, '.5f')) + '''     
'''

ax.text(6,0.841,text_m,horizontalalignment='left',color='black',fontsize=16,fontweight='normal')


text_s = '''
    * Train Standard Deviation : ''' + str(format(train_std, '.5f')) + '''
    * Test Standard Deviation : ''' + str(format(test_std, '.5f')) + ''' 
    * Val Standard Deviation : ''' + str(format(val_std, '.5f')) + '''     
'''

ax.text(0.5,0.841,text_s,horizontalalignment='left',color='black',fontsize=16,fontweight='normal')


ax.set_xlabel('No of variable at each split', fontsize=18, labelpad=20)
ax.set_ylabel('ROC_AUC Score', fontsize=18, labelpad=10)

ax.set_title('XGBoost - Train, Test, Val Error', pad=20, fontsize=30)

ax.legend()
ax.set_yticks(rg)

sns.despine()

plt.savefig('./xgb-ttv.jpg')

plt.tight_layout()

plt.show();